This version includes the data processing and the complete Cox model.

library(dplyr)
library(GEOquery) 
library(tidyr)
library(survival)
library(glmnet)
library(survminer)
library(caret)

1 Load Data and data processing

extract_gene_symbol <- function(gene_assignment) {
  # Split the string by //
  parts <- strsplit(gene_assignment, " // ")[[1]]
  
  # Return the second element (gene symbol)
  if (length(parts) >= 2) {
    return(trimws(parts[2]))
  } else {
    return(NA)
  }
}

extract_gse_data <- function(gse) {
  # filteredFeatureData <- fData(gse) %>% filter(gene_assignment != "---")
  # filteredFeatureData$gene_symbol <- sapply(filteredFeatureData$gene_assignment, extract_gene_symbol)
  return(list(
    featureData = fData(gse),
    phenoData = pData(gse),
    eMat = exprs(gse)
  ))
}

GSE78229 <- getGEO("GSE78229")$GSE78229_series_matrix.txt.gz %>% extract_gse_data()
GSE28735 <- getGEO("GSE28735")$GSE28735_series_matrix.txt.gz %>% extract_gse_data()
GSE62452 <- getGEO("GSE62452")$GSE62452_series_matrix.txt.gz %>% extract_gse_data()
find_common_genes <- function(...) {
  matrices <- list(...)
  common_genes <- Reduce(intersect, lapply(matrices, rownames))
  return(common_genes)
}

common_genes <- find_common_genes(GSE28735$eMat, GSE62452$eMat)

# subset_eMat_GSE78229 <- GSE78229$eMat[common_genes, ]
subset_eMat_GSE28735 <- GSE28735$eMat[common_genes, ]
subset_eMat_GSE62452 <- GSE62452$eMat[common_genes, ]

#combined_matrix <- cbind(subset_eMat_GSE78229, subset_eMat_GSE28735, subset_eMat_GSE62452)

combined_matrix <- cbind(subset_eMat_GSE28735, subset_eMat_GSE62452)

# print(dim(combined_matrix))
GSE28735$phenoData$is_dead <- as.numeric(
  replace(
    GSE28735$phenoData$`cancer_death:ch1`, 
    GSE28735$phenoData$`cancer_death:ch1` == "na", NA
    )
  )

non_tumor_rows <- grepl("non-tumor tissue", GSE28735$phenoData$`source_name_ch1`)

filtered_tumor <- GSE28735$phenoData[!non_tumor_rows, ]

filtered_28735 <- filtered_tumor[!is.na(filtered_tumor$is_dead), ] %>% rename(months_survived = `survival_month:ch1`)

# nrow(filtered_28735)
GSE62452$phenoData$is_dead <- as.numeric(
  replace(
    GSE62452$phenoData$`survival status:ch1`,
    GSE62452$phenoData$`survival status:ch1` %in% c("na", "?"), NA
    )
  )

tumor_rows <- grepl("Pancreatic tumor", GSE62452$phenoData$`tissue:ch1`)

filtered_tumor <- GSE62452$phenoData[tumor_rows, ]

filtered_62452 <- filtered_tumor[!is.na(filtered_tumor$is_dead), ] %>%
  rename(months_survived = `survival months:ch1`)

# nrow(filtered_62452)
combined_pheno <- bind_rows(filtered_28735, filtered_62452)
filtered_matrix <- combined_matrix[, colnames(combined_matrix) %in% rownames(combined_pheno)]

# dim(filtered_matrix)

gse <- list(
  featureData = GSE28735$featureData,
  phenoData = combined_pheno,
  eMat = filtered_matrix
)

2 Cox Model

2.1 Choose λ

# Construct survival object: assume that 'months_survived' in gse$phenoData is the survival time (numeric) and 'is_dead' is the event status (1 = death, 0 = survival)
surv_obj <- with(gse$phenoData, Surv(as.numeric(months_survived), as.numeric(is_dead)))

## Penalized Cox Model
# Build the prediction model using all 28869 genes 
# Prepare the prediction matrix x: the columns in gse$eMat represent samples; after transposition, rows represent samples and columns represent features
x <- t(gse$eMat)
x <- as.matrix(x)

# Fit the Cox model with Lasso penalty using cross-validation
set.seed(3888)
cvfit <- cv.glmnet(x, surv_obj, family = "cox", alpha = 1)
plot(cvfit, main = "Cross-validation for Penalized Cox Model")

best_lambda <- cvfit$lambda.min

Horizontal axis: Log(λ). When λ gets larger (to the left), the penalty becomes stronger and the model becomes more sparse. When λ gets smaller (to the right), the penalty becomes weaker and more genes are kept.

Vertical axis: Partial Likelihood Deviance. A smaller value means the model fits better.

Each red dot shows the average deviance for a λ value. The gray lines show the range of error across different folds.

The left dashed line shows λ.min, where the error is the smallest and the model fits the best.

The right dashed line shows λ.1se, which gives a slightly stronger penalty. It makes the model simpler and uses fewer genes.

We chose the λ.min on the left dashed line, which is 0.2227.

cat("Best lambda:", best_lambda, "\n")
## Best lambda: 0.256105

Through model calculation, the optimal λ value is 0.256105. This is the best choice for the Lasso penalty strength. Under this penalty, the model automatically shrinks the coefficients of many genes to zero, keeping only a small number of genes that are significantly related to survival.

2.2 Build Cox model and visualization

# Fit the final model using the optimal lambda
final_model <- glmnet(x, surv_obj, family = "cox", alpha = 1, lambda = best_lambda)
coef_final <- coef(final_model)

cat("Non-zero coefficients in the final model:\n")
## Non-zero coefficients in the final model:
nz_idx <- which(coef_final != 0)
df <- data.frame(gene_id = colnames(x)[nz_idx],
  coef    = as.numeric(coef_final[nz_idx]))
print(df)
##    gene_id         coef
## 1  7935116 -0.003537639
## 2  7946013 -0.382954574
## 3  7959574  0.293692359
## 4  7969815 -0.109893861
## 5  7973084 -0.163110615
## 6  7976852 -0.059143098
## 7  7983991 -0.041335595
## 8  7991465  0.536564415
## 9  8007603  0.345199673
## 10 8007904  0.129473714
## 11 8033809 -0.146641065
## 12 8050071 -0.203072016
## 13 8063636  0.101586816
## 14 8072242  0.126808505
## 15 8077781  0.257208263
## 16 8093624  0.032404711
## 17 8117965  0.002033266
## 18 8118086  0.065666610
## 19 8124998  0.050022930
## 20 8126288  0.192758339
## 21 8133549 -0.183834898
## 22 8172280  0.182825218

After training the model, a total of 22 genes were selected.

The absolute value of each gene’s coefficient reflects its contribution to the risk score: the larger the absolute value, the higher the weight of the gene in the model and the more important it is for distinguishing prognosis.

This can be seen more clearly in the figure.

library(ggplot2)
ggplot(df, aes(x = reorder(gene_id, coef), y = coef, fill = coef > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("red","blue"), labels = c("high-risk gene","protective gene")) +
  labs(x = NULL, y = "coef", title = "Lasso-Cox gene id") +
  theme_minimal()

A positive coefficient means that higher expression of the gene is associated with a higher risk of death, indicating a high-risk gene. This suggests that the gene may be promoting disease progression.

A negative coefficient means that higher expression of the gene is associated with a lower risk of death, indicating a protective gene. These genes may help slow down disease progression.

2.3 Result visualization

# Compute the risk scores for each sample (linear predictor)
risk_scores <- predict(final_model, newx = x, type = "link")
# Divide the samples into high-risk and low-risk groups by using the median risk score as the cutoff
risk_group <- ifelse(risk_scores > median(risk_scores), "High", "Low")

# Add the risk group information to the original phenotype data
gse$phenoData$risk_group <- factor(risk_group, levels = c("Low", "High"))

# Use the Kaplan-Meier method to compare the survival curves of the two risk groups
fit_km <- survfit(surv_obj ~ risk_group, data = gse$phenoData)
ggsurvplot(fit_km, data = gse$phenoData, pval = TRUE, risk.table = TRUE, 
           title = "Kaplan-Meier Survival Curves by Risk Group")

Risk grouping: Patients were divided into “low-risk group” (red) and “high-risk group” (cyan) based on the median risk score of all samples.

Kaplan–Meier curve: The horizontal axis shows follow-up time (months), and the vertical axis shows survival probability.

The red curve stays above the cyan curve at all times, indicating that patients in the low-risk group have a much higher survival rate compared to those in the high-risk group.

Log-rank test p-value: p < 0.0001, showing that the difference between the two survival curves is statistically highly significant. This means that the risk score based on the 29 selected genes can clearly distinguish between good and poor prognosis.

Risk table (Number at risk): Shown below the curves, it displays the number of patients still under follow-up at each time point, helping to evaluate the reliability of survival estimates in later stages.